Added a SAMTOOLS_PIPELINE to run multiple samtools commands at once#4571
Added a SAMTOOLS_PIPELINE to run multiple samtools commands at once#4571
Conversation
|
Love the idea |
mashehu
left a comment
There was a problem hiding this comment.
Please don't call it pipeline, that is already too overloaded.
I am also not 100% sure about the universality of this. Feels like we are basically moving away from atomic modules again and one needs to dig through several configs to read through several configs and docs to figure out what the module/command is actually doing
|
Hi guys. You were quick 😉 ! I am still testing a few things locally. I was then going to fill in the body of the PR to explain how this came up and share the link on Slack for discussion. I know we already discussed "samtools sormadup" and decided against, so I don't expect an easy ride 😅 but I think there are good reasons to still do something on this topic. |
|
As someone whose cluster is currently running out of scratch space, I definitely support the idea of chaining multiple samtools operations into one module, producing lots of individual bam/cram files uses a lot of memory when not necessary. |
|
I definitely approve of the idea. In terms of implementation, I would skip the for loop and only use if then else to make an array, and then I'm starting to think we need to revisit also how args are passed to modules. I would prefer that the args are indexed with the 'tool_subtool' key in this case for readability. I agree with the overloading of word pipeline too. I like the word compose but I'm not sure how to integrate it as I think the purpose of the composition also needs to be clear in the name as there could be quite a few compositions of a single tool after a time. |
I like the idea of |
|
I can't read properly on the phone ... |
|
How does something very explicit, like |
|
I think it would be better if we separated these aggregate commands in their own folder, like subworkflows are. It makes them easier to find, and highlights they're not atomic. Perhaps call the folder |
|
I think there's too much emphasis on the (non-)atomicity of this. The nf-core guidelines don't prohibit multiple commands running in a module. There is already a lot of modules that run multiple commands (the best proxy is to search for mulled containers: they typically indicate that the module needs another tool from another package). In my mind, I map "modules" to Nextflow As an alternative, what about naming this: |
|
While the guidelines don't prohibit multiple commands the reason is because the modules would be impractical without them. The modules are supposed to be as simple as possible without making them impractical. There may also be examples where this isn't the case, but there several modules that don't fully follow guidelines. For findability, I do think this needs to be separated in some way. This kind of process definitely follows what the purpose of subworkflows are supposed to be. As it stands, there are also several modules that are only used in a certain sequence and would benefit from being aggregated into a single process. This is part of a larger issue to me. |
|
I think this could slot into the |
|
Well, I did mention sub-workflows in my OP
The first technical question is: can nf-core sub-workflows have "local" modules ? I can't really see room for that in the repository |
|
@nf-core/core we need a decision where to place this, because this is a necessity. |
|
I'm all in favour of more piping, I think we need to use it much much more to speed our pipelines up. This is a pretty cool solution, but fairly complex? I'm not sure it will be very obvious to a new user how to give it a go. A number of modules that are defined by what they do is a simpler approach, fits more naturally with nf-core/modules and is more approachable to new users. e.g., let's imagine Clear, readable, quick to understand. In a pipeline it's obvious what it's doing: This is the current solution, which is fine but not obvious to a new developer? |
|
@adamrtalbot . Then maybe #3310 could be reopened because it's exactly what you're advocating for ? This PR is for a follow-up module that allows enabling/disabling the components of the command, without having to write a new module for every combination. FYI my PR is currently only for BAM→BAM transformations. |
|
While this would be complex to a newcomer, they still have the option stick to the more traditional path. I'm quite happy for processes like these to require their own documentation because the resource benefits are significant. In my opinion, there is no way to have best practice pipelines and also have everything newcomer friendly. We already do things that are not newcomer friendly like the way some parameters are separated into These processes could also help serve to improve developer skills by demonstrating practical examples showing groovy tips and tricks, and better resource usage. For pipeline readability, the developer can alias the name to something more inline what they intend to use the generalised process for. The rest of the complexity won't matter as long as it brings the benefits it was intended for. |
I think so. We've got a bit dogmatic about "1 tool, 1 module", to the point of losing sight of why that rule exists. In my mind, a module should be an single unit of work, not a piece of software. It's often the case that these two are the same but not always and sometimes nf-core pipelines are absurdly inefficient because everything runs in it's own module.
Then let's make things easier to use and more friendly. Not more complicated because reasons.
Nah, they'll just make a local module in their pipeline. To be fair, I quite like this module, think it's really good code and love the flexibility. I'd happy accept this under the golden rule: Minimum Useful Thing™️, but I'm firmly in the camp we should be simplifying and reducing nf-core, not adding complexity. |
This would be a positive change, but is everyone in agreement (to generally allow multiple commands)?
Then I would argue we're not annotating modules well enough and not making them findable. I think to help others, it would be useful if we added more comments to modules explaining anything that might be complicated.
I think we're in agreement here to make things as simple as possible. I don't think anyone wants to add complexity unless it's a necessity. |
|
FYI, there is already a module that runs samtools commands with a pipe: https://github.com/nf-core/modules/blob/master/modules/nf-core/samtools/collatefastq/main.nf |
|
In my opinion that's one of the modules that doesn't follow guidelines and shouldn't be in the |
|
We shouldn't follow the guidelines if they cause people to create broken pipelines. |
|
That's only for read pairs, that the input needs to be collated. |
|
But I also hope none of our guidelines cause people to make broken pipelines. At the end of the day, it is up to the developer to check that their code is correct, hopefully through rigorous testing and peer-review. |
|
Can anybody add what was decided in the last maintainers meeting? I couldn't make it, but I'm very interested in either this or my |
|
See nf-core/website#2327 for a basic summary and start on guidelines. @maxulysse is in charge of this. Basically this PR might not be maintainable by the average nf-core user (most maintainers were not in favor of it), and maintainers would like the script to be explicit with the tool, in particular so argsN maps to a particular tool/subcommand (i.e. like yours). However, they're also worried about maintenance burden, and since one can make local modules, these compound processes should likely go there unless they can be used across multiple pipelines. There's still space to provide more opinions though. |
|
Thanks @mahesh-panchal! |
|
@matthdsm We (@muffato) are already using your implementation as a local module. Would be great to have this directly from nf-core. |
|
Yes @matthdsm . In sanger-tol we've got sormadup already (almost exactly your version, the only change I did was removing the |
|
I'm in favor of resurrecting #3310, that sounds like a good candidate for a meta module that will be wonders in term of data footprint savings |
|
I was a dissenting voice on this, I like the flexibility of this "module" and would merge this in. |
I've recently spent some time optimising the resource usage of some Nextflow pipelines and I found that the samtools workflow to mark duplicates takes 5x longer to run and 3x more disk space when implemented the nf-core way, i.e. as 4 modules put together in a sub-workflow, compared to having a single local module with a bash pipeline.
Last time we discussed this sort of module, #3310 and https://nfcore.slack.com/archives/CJRH30T6V/p1682062426042809 , it was decided against though we didn't know the actual improvements. But both @drpatelh and @SPPearce suggested to reopen the discussion after seeing the figures. After all, the nf-core guidelines don't forbid it (bold type mine).
However, @drpatelh said that there should be options to disable some of the 4 commands, because not anyone may need them. Here is my take on this.
I introduce a new module that can run a bash pipeline (hence the name) of any commands the user wants in any order, as long as they take BAM/CRAM as inputs as outputs. It wasn't as trivial as I thought it would because samtools commands have different ways of defining the input and output files 🙃
It works like this:
workflow { input = [ [ id:'test', single_end:false ], // meta map file(params.test_data['sarscov2']['illumina']['test_paired_end_bam'], checkIfExists: true) ] commands = ['collate', 'fixmate', 'sort', 'markdup'] SAMTOOLS_PIPELINE ( input, [[],[]], commands ) SAMTOOLS_PIPELINE.out.output } process { withName: SAMTOOLS_PIPELINE { ext.args = [ fixmate: '-m', ] } }What I like about the approach:
What I dislike about the implementation:
task.ext.args${index}. So instead I makeext.argsa map where the key is the command name and the value is the string to add to the command-line. I don't know if/how closures can be used. I would very much prefer to use the regulartask.ext.args,task.ext.args2, etc..nf(commands = ['collate', 'fixmate', 'sort', 'markdup']) and the.config(ext.args = [ ... ]). Maybe I should move the definition of the commands to the.config?.command.shis very fragile. What I've got works but for instance I triedpipeline_command += "samtools ${this_command} ${all_args[index]} \\ \n" and that caused thecat <<-END_VERSIONS > versions.yml` to be indented, which broke the generation of the yml.What it needs:
--referencein the right place.What it may need:
processblock in the sub-workflow'smain.nf.PR checklist
versions.ymlfile.labelPROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-awarePROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-awarePROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware